Dataset statistics
| Number of variables | 10 |
|---|---|
| Number of observations | 20640 |
| Missing cells | 207 |
| Missing cells (%) | 0.1% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 1.6 MiB |
| Average record size in memory | 80.0 B |
Variable types
| Numeric | 9 |
|---|---|
| Categorical | 1 |
longitude is highly overall correlated with latitude and 2 other fields | High correlation |
latitude is highly overall correlated with longitude and 2 other fields | High correlation |
total_rooms is highly overall correlated with total_bedrooms and 2 other fields | High correlation |
total_bedrooms is highly overall correlated with total_rooms and 2 other fields | High correlation |
population is highly overall correlated with total_rooms and 2 other fields | High correlation |
households is highly overall correlated with total_rooms and 2 other fields | High correlation |
median_income is highly overall correlated with median_house_value | High correlation |
median_house_value is highly overall correlated with longitude and 3 other fields | High correlation |
ocean_proximity is highly overall correlated with longitude and 2 other fields | High correlation |
total_bedrooms has 207 (1.0%) missing values | Missing |
Reproduction
| Analysis started | 2022-11-23 09:48:32.556135 |
|---|---|
| Analysis finished | 2022-11-23 09:48:56.788913 |
| Duration | 24.23 seconds |
| Software version | pandas-profiling vv3.5.0 |
| Download configuration | config.json |
longitude
Real number (ℝ)
| Distinct | 844 |
|---|---|
| Distinct (%) | 4.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | -119.5697 |
| Minimum | -124.35 |
|---|---|
| Maximum | -114.31 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 20640 |
| Negative (%) | 100.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | -124.35 |
|---|---|
| 5-th percentile | -122.47 |
| Q1 | -121.8 |
| median | -118.49 |
| Q3 | -118.01 |
| 95-th percentile | -117.08 |
| Maximum | -114.31 |
| Range | 10.04 |
| Interquartile range (IQR) | 3.79 |
Descriptive statistics
| Standard deviation | 2.0035317 |
|---|---|
| Coefficient of variation (CV) | -0.016756182 |
| Kurtosis | -1.3301524 |
| Mean | -119.5697 |
| Median Absolute Deviation (MAD) | 1.28 |
| Skewness | -0.29780121 |
| Sum | -2467918.7 |
| Variance | 4.0141394 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| -118.31 | 162 | 0.8% |
| -118.3 | 160 | 0.8% |
| -118.29 | 148 | 0.7% |
| -118.27 | 144 | 0.7% |
| -118.32 | 142 | 0.7% |
| -118.28 | 141 | 0.7% |
| -118.35 | 140 | 0.7% |
| -118.36 | 138 | 0.7% |
| -118.19 | 135 | 0.7% |
| -118.37 | 128 | 0.6% |
| Other values (834) | 19202 |
| Value | Count | Frequency (%) |
| -124.35 | 1 | < 0.1% |
| -124.3 | 2 | < 0.1% |
| -124.27 | 1 | < 0.1% |
| -124.26 | 1 | < 0.1% |
| -124.25 | 1 | < 0.1% |
| -124.23 | 3 | |
| -124.22 | 1 | < 0.1% |
| -124.21 | 3 | |
| -124.19 | 4 | |
| -124.18 | 6 |
| Value | Count | Frequency (%) |
| -114.31 | 1 | < 0.1% |
| -114.47 | 1 | < 0.1% |
| -114.49 | 1 | < 0.1% |
| -114.55 | 1 | < 0.1% |
| -114.56 | 1 | < 0.1% |
| -114.57 | 3 | |
| -114.58 | 2 | |
| -114.59 | 2 | |
| -114.6 | 3 | |
| -114.61 | 3 |
latitude
Real number (ℝ)
| Distinct | 862 |
|---|---|
| Distinct (%) | 4.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 35.631861 |
| Minimum | 32.54 |
|---|---|
| Maximum | 41.95 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 32.54 |
|---|---|
| 5-th percentile | 32.82 |
| Q1 | 33.93 |
| median | 34.26 |
| Q3 | 37.71 |
| 95-th percentile | 38.96 |
| Maximum | 41.95 |
| Range | 9.41 |
| Interquartile range (IQR) | 3.78 |
Descriptive statistics
| Standard deviation | 2.1359524 |
|---|---|
| Coefficient of variation (CV) | 0.059945013 |
| Kurtosis | -1.1177598 |
| Mean | 35.631861 |
| Median Absolute Deviation (MAD) | 1.23 |
| Skewness | 0.465953 |
| Sum | 735441.62 |
| Variance | 4.5622926 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 34.06 | 244 | 1.2% |
| 34.05 | 236 | 1.1% |
| 34.08 | 234 | 1.1% |
| 34.07 | 231 | 1.1% |
| 34.04 | 221 | 1.1% |
| 34.09 | 212 | 1.0% |
| 34.02 | 208 | 1.0% |
| 34.1 | 203 | 1.0% |
| 34.03 | 193 | 0.9% |
| 33.93 | 181 | 0.9% |
| Other values (852) | 18477 |
| Value | Count | Frequency (%) |
| 32.54 | 1 | < 0.1% |
| 32.55 | 3 | < 0.1% |
| 32.56 | 10 | < 0.1% |
| 32.57 | 18 | |
| 32.58 | 26 | |
| 32.59 | 11 | |
| 32.6 | 9 | < 0.1% |
| 32.61 | 14 | |
| 32.62 | 13 | |
| 32.63 | 18 |
| Value | Count | Frequency (%) |
| 41.95 | 2 | |
| 41.92 | 1 | < 0.1% |
| 41.88 | 1 | < 0.1% |
| 41.86 | 3 | |
| 41.84 | 1 | < 0.1% |
| 41.82 | 1 | < 0.1% |
| 41.81 | 2 | |
| 41.8 | 3 | |
| 41.79 | 1 | < 0.1% |
| 41.78 | 3 |
housing_median_age
Real number (ℝ)
| Distinct | 52 |
|---|---|
| Distinct (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 28.639486 |
| Minimum | 1 |
|---|---|
| Maximum | 52 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 8 |
| Q1 | 18 |
| median | 29 |
| Q3 | 37 |
| 95-th percentile | 52 |
| Maximum | 52 |
| Range | 51 |
| Interquartile range (IQR) | 19 |
Descriptive statistics
| Standard deviation | 12.585558 |
|---|---|
| Coefficient of variation (CV) | 0.43944774 |
| Kurtosis | -0.80062885 |
| Mean | 28.639486 |
| Median Absolute Deviation (MAD) | 10 |
| Skewness | 0.060330638 |
| Sum | 591119 |
| Variance | 158.39626 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 52 | 1273 | 6.2% |
| 36 | 862 | 4.2% |
| 35 | 824 | 4.0% |
| 16 | 771 | 3.7% |
| 17 | 698 | 3.4% |
| 34 | 689 | 3.3% |
| 26 | 619 | 3.0% |
| 33 | 615 | 3.0% |
| 18 | 570 | 2.8% |
| 25 | 566 | 2.7% |
| Other values (42) | 13153 |
| Value | Count | Frequency (%) |
| 1 | 4 | < 0.1% |
| 2 | 58 | 0.3% |
| 3 | 62 | 0.3% |
| 4 | 191 | |
| 5 | 244 | |
| 6 | 160 | |
| 7 | 175 | |
| 8 | 206 | |
| 9 | 205 | |
| 10 | 264 |
| Value | Count | Frequency (%) |
| 52 | 1273 | |
| 51 | 48 | 0.2% |
| 50 | 136 | 0.7% |
| 49 | 134 | 0.6% |
| 48 | 177 | 0.9% |
| 47 | 198 | 1.0% |
| 46 | 245 | 1.2% |
| 45 | 294 | 1.4% |
| 44 | 356 | 1.7% |
| 43 | 353 | 1.7% |
total_rooms
Real number (ℝ)
| Distinct | 5926 |
|---|---|
| Distinct (%) | 28.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 2635.7631 |
| Minimum | 2 |
|---|---|
| Maximum | 39320 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 2 |
|---|---|
| 5-th percentile | 620.95 |
| Q1 | 1447.75 |
| median | 2127 |
| Q3 | 3148 |
| 95-th percentile | 6213.2 |
| Maximum | 39320 |
| Range | 39318 |
| Interquartile range (IQR) | 1700.25 |
Descriptive statistics
| Standard deviation | 2181.6153 |
|---|---|
| Coefficient of variation (CV) | 0.82769778 |
| Kurtosis | 32.630927 |
| Mean | 2635.7631 |
| Median Absolute Deviation (MAD) | 797 |
| Skewness | 4.1473435 |
| Sum | 54402150 |
| Variance | 4759445.1 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 1527 | 18 | 0.1% |
| 1613 | 17 | 0.1% |
| 1582 | 17 | 0.1% |
| 2127 | 16 | 0.1% |
| 1717 | 15 | 0.1% |
| 2053 | 15 | 0.1% |
| 1607 | 15 | 0.1% |
| 1722 | 15 | 0.1% |
| 1471 | 15 | 0.1% |
| 1703 | 15 | 0.1% |
| Other values (5916) | 20482 |
| Value | Count | Frequency (%) |
| 2 | 1 | < 0.1% |
| 6 | 1 | < 0.1% |
| 8 | 1 | < 0.1% |
| 11 | 1 | < 0.1% |
| 12 | 1 | < 0.1% |
| 15 | 2 | |
| 16 | 1 | < 0.1% |
| 18 | 4 | |
| 19 | 2 | |
| 20 | 2 |
| Value | Count | Frequency (%) |
| 39320 | 1 | |
| 37937 | 1 | |
| 32627 | 1 | |
| 32054 | 1 | |
| 30450 | 1 | |
| 30405 | 1 | |
| 30401 | 1 | |
| 28258 | 1 | |
| 27870 | 1 | |
| 27700 | 1 |
| Distinct | 1923 |
|---|---|
| Distinct (%) | 9.4% |
| Missing | 207 |
| Missing (%) | 1.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 537.87055 |
| Minimum | 1 |
|---|---|
| Maximum | 6445 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 137 |
| Q1 | 296 |
| median | 435 |
| Q3 | 647 |
| 95-th percentile | 1275.4 |
| Maximum | 6445 |
| Range | 6444 |
| Interquartile range (IQR) | 351 |
Descriptive statistics
| Standard deviation | 421.38507 |
|---|---|
| Coefficient of variation (CV) | 0.78343213 |
| Kurtosis | 21.985575 |
| Mean | 537.87055 |
| Median Absolute Deviation (MAD) | 162 |
| Skewness | 3.4595463 |
| Sum | 10990309 |
| Variance | 177565.38 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 280 | 55 | 0.3% |
| 331 | 51 | 0.2% |
| 345 | 50 | 0.2% |
| 343 | 49 | 0.2% |
| 393 | 49 | 0.2% |
| 348 | 48 | 0.2% |
| 394 | 48 | 0.2% |
| 328 | 48 | 0.2% |
| 309 | 47 | 0.2% |
| 272 | 47 | 0.2% |
| Other values (1913) | 19941 | |
| (Missing) | 207 | 1.0% |
| Value | Count | Frequency (%) |
| 1 | 1 | < 0.1% |
| 2 | 2 | < 0.1% |
| 3 | 5 | |
| 4 | 7 | |
| 5 | 6 | |
| 6 | 5 | |
| 7 | 6 | |
| 8 | 8 | |
| 9 | 7 | |
| 10 | 8 |
| Value | Count | Frequency (%) |
| 6445 | 1 | |
| 6210 | 1 | |
| 5471 | 1 | |
| 5419 | 1 | |
| 5290 | 1 | |
| 5033 | 1 | |
| 5027 | 1 | |
| 4957 | 1 | |
| 4952 | 1 | |
| 4819 | 1 |
population
Real number (ℝ)
| Distinct | 3888 |
|---|---|
| Distinct (%) | 18.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1425.4767 |
| Minimum | 3 |
|---|---|
| Maximum | 35682 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 3 |
|---|---|
| 5-th percentile | 348 |
| Q1 | 787 |
| median | 1166 |
| Q3 | 1725 |
| 95-th percentile | 3288 |
| Maximum | 35682 |
| Range | 35679 |
| Interquartile range (IQR) | 938 |
Descriptive statistics
| Standard deviation | 1132.4621 |
|---|---|
| Coefficient of variation (CV) | 0.79444447 |
| Kurtosis | 73.553116 |
| Mean | 1425.4767 |
| Median Absolute Deviation (MAD) | 440 |
| Skewness | 4.9358582 |
| Sum | 29421840 |
| Variance | 1282470.5 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 891 | 25 | 0.1% |
| 761 | 24 | 0.1% |
| 1227 | 24 | 0.1% |
| 1052 | 24 | 0.1% |
| 850 | 24 | 0.1% |
| 825 | 23 | 0.1% |
| 782 | 22 | 0.1% |
| 999 | 22 | 0.1% |
| 1005 | 22 | 0.1% |
| 753 | 21 | 0.1% |
| Other values (3878) | 20409 |
| Value | Count | Frequency (%) |
| 3 | 1 | < 0.1% |
| 5 | 1 | < 0.1% |
| 6 | 1 | < 0.1% |
| 8 | 4 | |
| 9 | 2 | |
| 11 | 1 | < 0.1% |
| 13 | 4 | |
| 14 | 3 | |
| 15 | 2 | |
| 17 | 2 |
| Value | Count | Frequency (%) |
| 35682 | 1 | |
| 28566 | 1 | |
| 16305 | 1 | |
| 16122 | 1 | |
| 15507 | 1 | |
| 15037 | 1 | |
| 13251 | 1 | |
| 12873 | 1 | |
| 12427 | 1 | |
| 12203 | 1 |
households
Real number (ℝ)
| Distinct | 1815 |
|---|---|
| Distinct (%) | 8.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 499.53968 |
| Minimum | 1 |
|---|---|
| Maximum | 6082 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 125 |
| Q1 | 280 |
| median | 409 |
| Q3 | 605 |
| 95-th percentile | 1162 |
| Maximum | 6082 |
| Range | 6081 |
| Interquartile range (IQR) | 325 |
Descriptive statistics
| Standard deviation | 382.32975 |
|---|---|
| Coefficient of variation (CV) | 0.76536413 |
| Kurtosis | 22.057988 |
| Mean | 499.53968 |
| Median Absolute Deviation (MAD) | 151 |
| Skewness | 3.4104377 |
| Sum | 10310499 |
| Variance | 146176.04 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 306 | 57 | 0.3% |
| 386 | 56 | 0.3% |
| 335 | 56 | 0.3% |
| 282 | 55 | 0.3% |
| 429 | 54 | 0.3% |
| 375 | 53 | 0.3% |
| 284 | 51 | 0.2% |
| 297 | 51 | 0.2% |
| 278 | 50 | 0.2% |
| 340 | 50 | 0.2% |
| Other values (1805) | 20107 |
| Value | Count | Frequency (%) |
| 1 | 1 | < 0.1% |
| 2 | 3 | < 0.1% |
| 3 | 4 | < 0.1% |
| 4 | 4 | < 0.1% |
| 5 | 7 | |
| 6 | 5 | |
| 7 | 10 | |
| 8 | 8 | |
| 9 | 9 | |
| 10 | 7 |
| Value | Count | Frequency (%) |
| 6082 | 1 | |
| 5358 | 1 | |
| 5189 | 1 | |
| 5050 | 1 | |
| 4930 | 1 | |
| 4855 | 1 | |
| 4769 | 1 | |
| 4616 | 1 | |
| 4490 | 1 | |
| 4372 | 1 |
median_income
Real number (ℝ)
| Distinct | 12928 |
|---|---|
| Distinct (%) | 62.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 3.870671 |
| Minimum | 0.4999 |
|---|---|
| Maximum | 15.0001 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 0.4999 |
|---|---|
| 5-th percentile | 1.60057 |
| Q1 | 2.5634 |
| median | 3.5348 |
| Q3 | 4.74325 |
| 95-th percentile | 7.300305 |
| Maximum | 15.0001 |
| Range | 14.5002 |
| Interquartile range (IQR) | 2.17985 |
Descriptive statistics
| Standard deviation | 1.8998217 |
|---|---|
| Coefficient of variation (CV) | 0.4908249 |
| Kurtosis | 4.9525241 |
| Mean | 3.870671 |
| Median Absolute Deviation (MAD) | 1.0642 |
| Skewness | 1.6466567 |
| Sum | 79890.649 |
| Variance | 3.6093226 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 3.125 | 49 | 0.2% |
| 15.0001 | 49 | 0.2% |
| 2.875 | 46 | 0.2% |
| 2.625 | 44 | 0.2% |
| 4.125 | 44 | 0.2% |
| 3.875 | 41 | 0.2% |
| 3.375 | 38 | 0.2% |
| 3 | 38 | 0.2% |
| 4 | 37 | 0.2% |
| 3.625 | 37 | 0.2% |
| Other values (12918) | 20217 |
| Value | Count | Frequency (%) |
| 0.4999 | 12 | |
| 0.536 | 10 | |
| 0.5495 | 1 | < 0.1% |
| 0.6433 | 1 | < 0.1% |
| 0.6775 | 1 | < 0.1% |
| 0.6825 | 1 | < 0.1% |
| 0.6831 | 1 | < 0.1% |
| 0.696 | 1 | < 0.1% |
| 0.6991 | 1 | < 0.1% |
| 0.7007 | 1 | < 0.1% |
| Value | Count | Frequency (%) |
| 15.0001 | 49 | |
| 15 | 2 | < 0.1% |
| 14.9009 | 1 | < 0.1% |
| 14.5833 | 1 | < 0.1% |
| 14.4219 | 1 | < 0.1% |
| 14.4113 | 1 | < 0.1% |
| 14.2959 | 1 | < 0.1% |
| 14.2867 | 1 | < 0.1% |
| 13.947 | 1 | < 0.1% |
| 13.8556 | 1 | < 0.1% |
median_house_value
Real number (ℝ)
| Distinct | 3842 |
|---|---|
| Distinct (%) | 18.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 206855.82 |
| Minimum | 14999 |
|---|---|
| Maximum | 500001 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 161.4 KiB |
Quantile statistics
| Minimum | 14999 |
|---|---|
| 5-th percentile | 66200 |
| Q1 | 119600 |
| median | 179700 |
| Q3 | 264725 |
| 95-th percentile | 489810 |
| Maximum | 500001 |
| Range | 485002 |
| Interquartile range (IQR) | 145125 |
Descriptive statistics
| Standard deviation | 115395.62 |
|---|---|
| Coefficient of variation (CV) | 0.55785531 |
| Kurtosis | 0.32787024 |
| Mean | 206855.82 |
| Median Absolute Deviation (MAD) | 68400 |
| Skewness | 0.97776327 |
| Sum | 4.2695041 × 109 |
| Variance | 1.3316148 × 1010 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 500001 | 965 | 4.7% |
| 137500 | 122 | 0.6% |
| 162500 | 117 | 0.6% |
| 112500 | 103 | 0.5% |
| 187500 | 93 | 0.5% |
| 225000 | 92 | 0.4% |
| 350000 | 79 | 0.4% |
| 87500 | 78 | 0.4% |
| 275000 | 65 | 0.3% |
| 150000 | 64 | 0.3% |
| Other values (3832) | 18862 |
| Value | Count | Frequency (%) |
| 14999 | 4 | |
| 17500 | 1 | < 0.1% |
| 22500 | 4 | |
| 25000 | 1 | < 0.1% |
| 26600 | 1 | < 0.1% |
| 26900 | 1 | < 0.1% |
| 27500 | 1 | < 0.1% |
| 28300 | 1 | < 0.1% |
| 30000 | 2 | |
| 32500 | 4 |
| Value | Count | Frequency (%) |
| 500001 | 965 | |
| 500000 | 27 | 0.1% |
| 499100 | 1 | < 0.1% |
| 499000 | 1 | < 0.1% |
| 498800 | 1 | < 0.1% |
| 498700 | 1 | < 0.1% |
| 498600 | 1 | < 0.1% |
| 498400 | 1 | < 0.1% |
| 497600 | 1 | < 0.1% |
| 497400 | 1 | < 0.1% |
ocean_proximity
Categorical
| Distinct | 5 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 161.4 KiB |
| <1H OCEAN | |
|---|---|
| INLAND | |
| NEAR OCEAN | |
| NEAR BAY | |
| ISLAND | 5 |
Length
| Max length | 10 |
|---|---|
| Median length | 9 |
| Mean length | 8.0649225 |
| Min length | 6 |
Characters and Unicode
| Total characters | 166460 |
|---|---|
| Distinct characters | 16 |
| Distinct categories | 4 ? |
| Distinct scripts | 2 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | NEAR BAY |
|---|---|
| 2nd row | NEAR BAY |
| 3rd row | NEAR BAY |
| 4th row | NEAR BAY |
| 5th row | NEAR BAY |
Common Values
| Value | Count | Frequency (%) |
| <1H OCEAN | 9136 | |
| INLAND | 6551 | |
| NEAR OCEAN | 2658 | 12.9% |
| NEAR BAY | 2290 | 11.1% |
| ISLAND | 5 | < 0.1% |
Length
Histogram of lengths of the category
Common Values (Plot)
| Value | Count | Frequency (%) |
| ocean | 11794 | |
| 1h | 9136 | |
| inland | 6551 | |
| near | 4948 | |
| bay | 2290 | 6.6% |
| island | 5 | < 0.1% |
Most occurring characters
| Value | Count | Frequency (%) |
| N | 29849 | |
| A | 25588 | |
| E | 16742 | |
| 14084 | ||
| O | 11794 | 7.1% |
| C | 11794 | 7.1% |
| < | 9136 | 5.5% |
| 1 | 9136 | 5.5% |
| H | 9136 | 5.5% |
| I | 6556 | 3.9% |
| Other values (6) | 22645 |
Most occurring categories
| Value | Count | Frequency (%) |
| Uppercase Letter | 134104 | |
| Space Separator | 14084 | 8.5% |
| Math Symbol | 9136 | 5.5% |
| Decimal Number | 9136 | 5.5% |
Most frequent character per category
Uppercase Letter
| Value | Count | Frequency (%) |
| N | 29849 | |
| A | 25588 | |
| E | 16742 | |
| O | 11794 | 8.8% |
| C | 11794 | 8.8% |
| H | 9136 | 6.8% |
| I | 6556 | 4.9% |
| L | 6556 | 4.9% |
| D | 6556 | 4.9% |
| R | 4948 | 3.7% |
| Other values (3) | 4585 | 3.4% |
Space Separator
| Value | Count | Frequency (%) |
| 14084 |
Math Symbol
| Value | Count | Frequency (%) |
| < | 9136 |
Decimal Number
| Value | Count | Frequency (%) |
| 1 | 9136 |
Most occurring scripts
| Value | Count | Frequency (%) |
| Latin | 134104 | |
| Common | 32356 | 19.4% |
Most frequent character per script
Latin
| Value | Count | Frequency (%) |
| N | 29849 | |
| A | 25588 | |
| E | 16742 | |
| O | 11794 | 8.8% |
| C | 11794 | 8.8% |
| H | 9136 | 6.8% |
| I | 6556 | 4.9% |
| L | 6556 | 4.9% |
| D | 6556 | 4.9% |
| R | 4948 | 3.7% |
| Other values (3) | 4585 | 3.4% |
Common
| Value | Count | Frequency (%) |
| 14084 | ||
| < | 9136 | |
| 1 | 9136 |
Most occurring blocks
| Value | Count | Frequency (%) |
| ASCII | 166460 |
Most frequent character per block
ASCII
| Value | Count | Frequency (%) |
| N | 29849 | |
| A | 25588 | |
| E | 16742 | |
| 14084 | ||
| O | 11794 | 7.1% |
| C | 11794 | 7.1% |
| < | 9136 | 5.5% |
| 1 | 9136 | 5.5% |
| H | 9136 | 5.5% |
| I | 6556 | 3.9% |
| Other values (6) | 22645 |
Auto
The auto setting is an interpretable pairwise column metric of the following mapping:- Variable_type-Variable_type : Method, Range
- Categorical-Categorical : Cramer's V, [0,1]
- Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
- Numerical-Numerical : Spearman's ρ, [-1,1]
This configuration uses the recommended metric for each pair of columns.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
| 5 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 | 193.0 | 4.0368 | 269700.0 | NEAR BAY |
| 6 | -122.25 | 37.84 | 52.0 | 2535.0 | 489.0 | 1094.0 | 514.0 | 3.6591 | 299200.0 | NEAR BAY |
| 7 | -122.25 | 37.84 | 52.0 | 3104.0 | 687.0 | 1157.0 | 647.0 | 3.1200 | 241400.0 | NEAR BAY |
| 8 | -122.26 | 37.84 | 42.0 | 2555.0 | 665.0 | 1206.0 | 595.0 | 2.0804 | 226700.0 | NEAR BAY |
| 9 | -122.25 | 37.84 | 52.0 | 3549.0 | 707.0 | 1551.0 | 714.0 | 3.6912 | 261100.0 | NEAR BAY |
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 20630 | -121.32 | 39.29 | 11.0 | 2640.0 | 505.0 | 1257.0 | 445.0 | 3.5673 | 112000.0 | INLAND |
| 20631 | -121.40 | 39.33 | 15.0 | 2655.0 | 493.0 | 1200.0 | 432.0 | 3.5179 | 107200.0 | INLAND |
| 20632 | -121.45 | 39.26 | 15.0 | 2319.0 | 416.0 | 1047.0 | 385.0 | 3.1250 | 115600.0 | INLAND |
| 20633 | -121.53 | 39.19 | 27.0 | 2080.0 | 412.0 | 1082.0 | 382.0 | 2.5495 | 98300.0 | INLAND |
| 20634 | -121.56 | 39.27 | 28.0 | 2332.0 | 395.0 | 1041.0 | 344.0 | 3.7125 | 116800.0 | INLAND |
| 20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
| 20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
| 20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
| 20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
| 20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |